Movie Rating Model and Predictor
Movie Rating Model and Predictor
Part 1: Data
The data were comprised of audience and critics opinions, awards, studio, and actor information from Rotten Tomatoes, imdb, and BoxOfficeMojo.com for a random sample of 651 movies produced and released prior to 2016.
Data Sources
Rotten Tomatoes
Launched in August 1998 by Senh Duong, Rotten Tomatoes is an American review aggregation website for film and television.
IMDB
Generalizability
Data Preprocessing
As a first step, variables such as website addresses and film titles were removed from the data set, then some 619 completed cases were extracted for downstream analysis. Since the focus was on theatrical releases, TV movies were excised from the data set. Next, three additional data sets were created:
* Film Box Office * Director Profiles
* Actor Profiles
Film Box Office The overarching aim was to understand which factors most influenced box office success for a film. Since box office revenue was not among the features provided in the data set, the first preprocessing step was to identify (or create) a response variable that would correlate with box office success. As such, total box office was obtained from the BoxOfficeMojo.com site for a random sampling of some 230 films from the original sample. The first step in the bivariate analysis was to identify which of the provided variables (or variables derived from those provided) most correlated with box office success.
Director / Actor Profiles: The purpose of the director/actor profiles was to capture the experience and popularity of each director and actor. The experience variables were simply the total number of films in the sample in which the director or actor was listed in the original sample. Popularity, in the director’s case was the sum of the IMDB votes for the director’s films. Similarly, the popularity for an actor was the sum of the allocated IMDB votes for films in which the actor was listed as one of the top 5. Actors were allocated IMDB votes as follows:
* 40% of total film IMDB votes for actor1
* 30% of total film IMDB votes for actor2
* 15% of total film IMDB votes for actor3
* 10% of total film IMDB votes for actor4
* 5% of total film IMDB votes for actor5
These variables were then merged into the main sample data set.
Table 1: Variables added to the data set
| Variable | Description |
|---|---|
| box_office | Box office revenue from BoxOfficeMojo.com |
| box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
| cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| cast_experience_log | Log of the sum across all cast members for a film, of the number of films in which each actor appeared |
| cast_votes | Total number of allocated IMDB votes for the cast of a film |
| cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| director_experience | Total number of films in sample for a director |
| director_experience_log | Log total number of films in sample for a director |
| imdb_num_votes_log | Log number of IMDB votes |
| runtime_log | Log runtime of movie (in minutes) |
| scores | 10 * IMDB Rating + critics score + audience_score |
| scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| thtr_days | Number of days between theatre and dvd release |
| thtr_days_log | Log number of days between theatre and dvd release |
| thtr_rel_season | Season the movie was released in theaters |
Table 2: Variables removed from the data set
| Variable | Description | Rationale |
|---|---|---|
| actor1 | First main actor/actress in the abridged cast of the movie | Not predictive without other data |
| actor2 | Second main actor/actress in the abridged cast of the movie | Not predictive without other data |
| actor3 | Third main actor/actress in the abridged cast of the movie | Not predictive without other data |
| actor4 | Fourth main actor/actress in the abridged cast of the movie | Not predictive without other data |
| actor5 | Fifth main actor/actress in the abridged cast of the movie | Not predictive without other data |
| audience_rating | Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright) | Redundant with audience_score |
| critics_rating | Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) | Redundant with critics_score |
| director | Director of the movie | Not predictive without other data |
| dvd_rel_day | Day of the month the movie is released on DVD | No predictive value |
| dvd_rel_month | Month the movie is released on DVD | No predictive value |
| dvd_rel_year | Year the movie is released on DVD | No predictive value |
| imdb_url | Link to IMDB page for the movie | No predictive value |
| rt_url | Link to Rotten Tomatoes page for the movie | No predictive value |
| studio | The studio that produced the film | Not a variable that Paramount can change |
| thtr_rel_day | Day of the month the movie is released in theaters | No predictive value |
| thtr_rel_year | Year the movie is released in theaters | No predictive value |
| title | Title of movie | No predictive value |
| title_type | Type of movie (Documentary, Feature Film, TV Movie) | Redundant with genre |
| top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) | Redundant with box office success. |
The full resultant codebook for the data set can be found in Appendix A
Part 2: Research question
The underlying intent of this analysis was to determine the factors that most influence box office success for a film. Since box office revenue was not among the variables included in the raw data set, the first task was to determine which of the selected (or derived) variables would stand as a proxy for box office success. As such the first research question is concretely stated as follows:
> Which of the selected or derived variables is most highly associated / correlated with total lifetime box office revenue
Once this proxy response variable was determined, the features that are most highly associated / correlated with this response variable were examined via the following research question.
> Which features are most highly associated / correlated with the proxy response for box office success
Part 3: Exploratory data analysis
The exploratory data analysis comprised both a univariate and bivariate examination of the variables.
Univariate Analysis
Univariate Analysis of Categorical Variables
The purpose of the univariate analysis of categorical variables was to examine the relative frequencies and proportions of observations for each level of the categorical level. The categorical variables included at this stage of the analysis are indicated in Table 3.
Table 3: Categorical Variables| Variable | Description |
|---|---|
| best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| best_actress_win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| thtr_rel_month | Month the movie is released in theaters |
| thtr_rel_season | Season the movie was released in theaters |
For brevity reasons, this section briefly covers the five variables most important to the modeling effort, in priority order. The complete analysis can be found in Appendix B.
Director
The work of 503 directors was included in the sample provided for this project. Data with respect to the number of films in the sample per director were captured in the director experience variable defined in the quantitative section.
Genre
The drama genre represented a plurality of the releases in the sample, followed by comedy action & adventure then mystery & suspense. The top four genres account for nearly 80% of the films in the sample. Figure 1: Films by genre
MPAA Rating
Rated R films accounted for over 50% of the releases, followed by PG and PG-13. Collectively, R, PG, and PG-13 rated films represent 90% of the films in the sample. NC-17 films were excluded from this analysis. Figure 2: Films by MPAA Rating
Best Picture
The best picture nomination variable proved to be among the top five most influential categorical variables. Typically, variables which such inbalance would be under consideration for exclusion as it might bias the linear regression slopes. The decision in this case was to assume that with random sampling, these ratios reflected the true population proportions, and to keep the variables for further analysis during the modeling stage.
Figure 3: Best picture nominations and wins
Month of Theatrical Release
Though the plurality of features in the sample (32%) were released during the months of January, June, October and December, the distribution of theatrical release months appeared fairly balanced within the sample. Figure 4: Theatrical releases by month
Univariate Analysis of Quantitative Variables
The primary aim of this analysis was to examine the distribution of the variables vis-a-vis a normal distribution, and to identify potential outliers. Summary statistics, histograms, boxplots, normal quantile-quantile plots were rendered for each variable. The quantitative variables included at this stage of the analysis are indicated in Table 4.
Table 4: Quantitative Variables| Variable | Description |
|---|---|
| audience_score | Audience score on Rotten Tomatoes |
| box_office | Box office revenue from BoxOfficeMojo.com |
| box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
| cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| cast_experience_log | Log of the sum across all cast members for a film, of the number of films in which each actor appeared |
| cast_votes | Total number of allocated IMDB votes for the cast of a film |
| cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| critics_score | Critics score on Rotten Tomatoes |
| director_experience | Total number of films in sample for a director |
| director_experience_log | Log total number of films in sample for a director |
| imdb_num_votes | Number of votes on IMDB |
| imdb_num_votes_log | Log number of IMDB votes |
| imdb_rating | Rating on IMDB |
| runtime | Runtime of movie (in minutes) |
| runtime_log | Log runtime of movie (in minutes) |
| scores | 10 * IMDB Rating + critics score + audience_score |
| scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| thtr_days | Number of days between theatre and dvd release |
| thtr_days_log | Log number of days between theatre and dvd release |
Again, the five most influential quantitative variables are covered here in some detail. The complete analysis can be found in Appendix B.
Director Experience
This derived variable measured the relative experience of a given director and was defined as the sum of the observations for the director associated with each film.
Table 5: Director experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 1 | 1 | 1.5 | 2 | 1 | 4 | 0 | 0.75 | 51.4 | 2.55 | 1.73 |
Figure 5: Director experience histogram and QQ Plot
Figure 6: Director experience boxplot
Central Tendency: The summary statistics (Table 5) the central tendency for director experience was 1 films and 1.5 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.75, corresponds with a coefficient of variation of 51.4%, indicating a high degree of dispersion.
Shape of Distribution: The sample skewness (1.73), indicated that the distribution of director experience was right-skewed. The sample kurtosis (2.55), indicated that the distribution of director experience was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 5 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 6, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 1, 2, and 1, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 3.5]. Indeed, this confirmed the existence of 20 outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Cast Experience
This derived variable measured the relative experience of a given cast and was defined as the sum of the observations for the cast associated with each film.
Table 6: Cast experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 5 | 6 | 7 | 7.5 | 9 | 3 | 15 | 0 | 2.16 | 29 | 0.32 | 0.89 |
Figure 7: Cast experience histogram and QQ Plot
Figure 8: Cast experience boxplot
Central Tendency: The summary statistics (Table 6) the central tendency for cast experience was 7 films and 7.5 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.16, corresponds with a coefficient of variation of 29%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (0.89), indicated that the distribution of cast experience was right-skewed. The sample kurtosis (0.32), indicated that the distribution of cast experience was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 7 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 8, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 6, 9, and 3, respectively. This yielded a 1.5xIQR ‘acceptable’ range [1.5, 13.5]. Indeed, this confirmed the existence of 7 outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Number of IMDB Votes
This variable captured the number of IMDB votes cast for each film.
Table 7: IMDB votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 183 | 5035 | 16511 | 60193.3 | 62773 | 57738 | 893008 | 0 | 114459.8 | 190.2 | 19.12 | 3.96 |
Figure 9: IMDB votes histogram and QQ Plot
Figure 10: IMDB votes boxplot
Central Tendency: The summary statistics (Table 7) the central tendency for imdb votes was 16,511 votes and 60,193.3 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 114,459.75, corresponds with a coefficient of variation of 190.2%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (3.96), indicated that the distribution of imdb votes was right-skewed. The sample kurtosis (19.12), indicated that the distribution of imdb votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 9 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 10, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 5,035, 62,773, and 57,738, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 149,380]. Indeed, this confirmed the existence of 67 outliers.
Log Number of IMDB Votes
This was a log transformation of the IMDB votes variable.
Table 8: Log IMDB votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 7.5 | 12.3 | 14 | 14.1 | 15.9 | 3.6 | 19.8 | 0 | 2.35 | 16.6 | -0.56 | 0.03 |
Figure 11: Log IMDB votes histogram and QQ Plot
Figure 12: Log IMDB votes boxplot
Central Tendency: The summary statistics (Table 8) the central tendency for imdb log votes was 14 log votes and 14.1 log votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.35, corresponds with a coefficient of variation of 16.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.03), indicated that the distribution of imdb log votes was approximately symmetric. The sample kurtosis (-0.56), indicated that the distribution of imdb log votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 11 reveal a nearly normal distribution.
Outliers: The boxplot in Figure 12, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 12.3, 15.9, and 3.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.9, 21.3]. Indeed, this confirmed the existence of no outliers.
IMDB Ratings
This variable captured the IMDB rating for each film
Table 9: IMDB rating summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1.9 | 5.9 | 6.6 | 6.5 | 7.3 | 1.4 | 9 | 0 | 1.08 | 16.6 | 1.31 | -0.89 |
Figure 13: IMDB rating histogram and QQ Plot
Figure 14: IMDB rating boxplot
Central Tendency: The summary statistics (Table 9) the central tendency for imdb rating was 6.6 points and 6.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 1.08, corresponds with a coefficient of variation of 16.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.89), indicated that the distribution of imdb rating was left-skewed. The sample kurtosis (1.31), indicated that the distribution of imdb rating was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 13 reveal a nearly normal distribution.
Outliers: The boxplot in Figure 14, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 5.9, 7.3, and 1.4, respectively. This yielded a 1.5xIQR ‘acceptable’ range [3.8, 9.4]. Indeed, this confirmed the existence of 18 outliers.
Critics Scores
This variable captured the critics scores for each film
Table 10: Critics score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 33 | 61 | 57.4 | 83 | 50 | 100 | 0 | 28.37 | 49.4 | -1.17 | -0.26 |
Figure 15: Critics score histogram and QQ Plot
Figure 16: Critics score boxplot
Central Tendency: The summary statistics (Table 10) the central tendency for critics score was 61 points and 57.4 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28.37, corresponds with a coefficient of variation of 49.4%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.26), indicated that the distribution of critics score was approximately symmetric. The sample kurtosis (-1.17), indicated that the distribution of critics score was platykurtic or light-tailed. The histogram and QQ plot in Figure 15 reveals a left skewed distribution that departs from normality.
Outliers: The boxplot in Figure 16, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 83, and 50, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 158]. Indeed, this confirmed the existence of no outliers.
Audience Scores
This variable captured the audience scores for each film
Table 11: Audience score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 33 | 61 | 57.4 | 83 | 50 | 100 | 0 | 28.37 | 49.4 | -1.17 | -0.26 |
Figure 17: Audience score histogram and QQ Plot
Figure 18: Audience score boxplot
Central Tendency: The summary statistics (Table 11) the central tendency for audience score was 61 points and 57.4 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28.37, corresponds with a coefficient of variation of 49.4%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.26), indicated that the distribution of audience score was approximately symmetric. The sample kurtosis (-1.17), indicated that the distribution of audience score was platykurtic or light-tailed. The histogram and QQ plot in Figure 17 reveals a left skewed distribution that departs from normality.
Outliers: The boxplot in Figure 18, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 83, and 50, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 158]. Indeed, this confirmed the existence of no outliers.
Cast Votes
This variable captured the total number of votes allocated to each cast member for a film.
Table 12: Cast votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 183 | 18552.5 | 78379.2 | 158644.9 | 228174.2 | 209621.7 | 1504872 | 0 | 198107.9 | 124.9 | 6.57 | 2.17 |
Figure 19: Cast votes histogram and QQ Plot
Figure 20: Cast votes boxplot
Central Tendency: The summary statistics (Table 12) the central tendency for cast votes was 78,379.2 votes and 158,644.9 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 198,107.89, corresponds with a coefficient of variation of 124.9%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (2.17), indicated that the distribution of cast votes was right-skewed. The sample kurtosis (6.57), indicated that the distribution of cast votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 19 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 20, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 18,552.5, 228,174.2, and 209,621.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 542,606.75]. Indeed, this confirmed the existence of 26 outliers.
Log Cast Votes
This is a log transformation of the cast votes variable.
Table 13: Log cast votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 7.5 | 14.2 | 16.3 | 15.9 | 17.8 | 3.6 | 20.5 | 0 | 2.46 | 15.5 | -0.08 | -0.67 |
Figure 21: Log cast votes histogram and QQ Plot
Figure 22: Log cast votes boxplot
Central Tendency: The summary statistics (Table 13) the central tendency for log cast votes was 16.3 log(votes) and 15.9 log(votes) for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.46, corresponds with a coefficient of variation of 15.5%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.67), indicated that the distribution of log cast votes was left-skewed. The sample kurtosis (-0.08), indicated that the distribution of log cast votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 21 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 22, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 14.2, 17.8, and 3.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [8.8, 23.2]. Indeed, this confirmed the existence of 4 outliers.
Scores
This variable captured the total score for each film defined as 10 * IMDB Rating + critics score + audience_score.
Table 10Table 11Table 14: Scores summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 38 | 145 | 187 | 184.5 | 232 | 87 | 284 | 0 | 54.52 | 29.6 | -0.82 | -0.32 |
Figure 15Figure 17Figure 23: Scores histogram and QQ Plot
Figure 16Figure 18Figure 24: Scores boxplot
Central Tendency: The summary statistics (Table 10Table 11Table 14) the central tendency for total scores was 187 points and 184.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 54.52, corresponds with a coefficient of variation of 29.6%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.32), indicated that the distribution of total scores was approximately symmetric. The sample kurtosis (-0.82), indicated that the distribution of total scores was platykurtic or light-tailed. The histogram and QQ plot in Figure 15Figure 17Figure 23 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 16Figure 18Figure 24, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 145, 232, and 87, respectively. This yielded a 1.5xIQR ‘acceptable’ range [14.5, 362.5]. Indeed, this confirmed the existence of no outliers.
Log Scores
This is a log transformation of scores variable.
Table 15: Log scores summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 5.2 | 7.2 | 7.5 | 7.5 | 7.9 | 0.7 | 8.1 | 0 | 0.5 | 6.7 | 1.12 | -1.07 |
Figure 25: Log scores histogram and QQ Plot
Figure 26: Log scores boxplot
Central Tendency: The summary statistics (Table 15) the central tendency for log total scores was 7.5 points and 7.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.5, corresponds with a coefficient of variation of 6.7%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-1.07), indicated that the distribution of log total scores was left-skewed. The sample kurtosis (1.12), indicated that the distribution of log total scores was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 25 reveals a left skewed distribution that departs rather significantly from normality.
Outliers: The boxplot in Figure 26, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 7.2, 7.9, and 0.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.15, 8.95]. Indeed, this confirmed the existence of 13 outliers.
Runtime
Total lifetime was obtained for a subset of 100 randomly selected films from the movie data set. This is an analysis of runtime for this random sampling.
Table 16: Runtime summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 65 | 93 | 103 | 106.6 | 116 | 23 | 267 | 0 | 19.12 | 17.9 | 9.67 | 1.98 |
Figure 27: Runtime histogram and QQ Plot
Figure 28: Runtime boxplot
Central Tendency: The summary statistics (Table 16) the central tendency for runtime was 103 minutes and 106.6 minutes for the median and mean, respectively.
Dispersion: The standard deviation, s = 19.12, corresponds with a coefficient of variation of 17.9%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (1.98), indicated that the distribution of runtime was right-skewed. The sample kurtosis (9.67), indicated that the distribution of runtime was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 27 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 28, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 93, 116, and 23, respectively. This yielded a 1.5xIQR ‘acceptable’ range [58.5, 150.5]. Indeed, this confirmed the existence of 16 outliers.
Theatre Days
This variable captured the number of days between theatrical and dvd release
Table 17: Theatre says summary statistics Figure 29: Theatre days histogram and QQ Plot
NULL
Figure 30: Theatre days boxplot
Central Tendency: The summary statistics (Table 17)
Dispersion:
Shape of Distribution: The histogram and QQ plot in Figure 29 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 30, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant.
Box Office
Total lifetime box office revenue was obtained for a subset of 100 randomly selected films from the movie data set. This is an analysis of box office revenue for this random sampling.
Table 18: Box office revenue summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 231 | 2749 | 1064483 | 13349927 | 38611880 | 49655390 | 48590907 | 658672302 | 0 | 68922521 | 178.5 | 31.2 | 4.55 |
Figure 31: Box office revenue histogram and QQ Plot
Figure 32: Box office revenue boxplot
Central Tendency: The summary statistics (Table 18) the central tendency for box office was 13,349,927 dollars and 38,611,879.9 dollars for the median and mean, respectively.
Dispersion: The standard deviation, s = 68,922,520.54, corresponds with a coefficient of variation of 178.5%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.55), indicated that the distribution of box office was right-skewed. The sample kurtosis (31.2), indicated that the distribution of box office was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 31 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 32, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 1,064,483, 49,655,390, and 48,590,907, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 122,541,750.5]. Indeed, this confirmed the existence of 21 outliers.
Log Box Office
This is a log transformation of the box office variable.
Table 19: Log box office revenue summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 231 | 11.4 | 20 | 23.7 | 22.6 | 25.6 | 5.5 | 29.3 | 0 | 3.84 | 17 | -0.21 | -0.8 |
Figure 33: Log box office revenue histogram and QQ Plot
Figure 34: Log box office revenue boxplot
Central Tendency: The summary statistics (Table 19) the central tendency for log box office was 23.7 log(dollars) and 22.6 log(dollars) for the median and mean, respectively.
Dispersion: The standard deviation, s = 3.84, corresponds with a coefficient of variation of 17%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.8), indicated that the distribution of log box office was left-skewed. The sample kurtosis (-0.21), indicated that the distribution of log box office was platykurtic or light-tailed. The histogram and QQ plot in Figure 33 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 34, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 20, 25.6, and 5.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [11.75, 33.85]. Indeed, this confirmed the existence of 1 outliers.
Bivariate Analysis
Dependent Variable
As mentioned above, the first objective was to identify an available variable that would proxy for box office success. Correlation tests were conducted between the quantitative variables and the log of box office revenue. The top 5 variables most highly correlated with log of box office revenue are summarized in Table 20.
Table 20 Top 5 variables most highly correlated with box office revenue| Variable | Correlation | Statistic | df | p.value | X95..CI |
|---|---|---|---|---|---|
| imdb_num_votes_log | 0.69 | 14.31 | 229 | < 0.05 | [ 0.61 , 0.75 ] |
| cast_votes_log | 0.57 | 10.39 | 229 | < 0.05 | [ 0.47 , 0.65 ] |
| imdb_num_votes | 0.47 | 8.04 | 229 | < 0.05 | [ 0.36 , 0.56 ] |
| cast_votes | 0.46 | 7.82 | 229 | < 0.05 | [ 0.35 , 0.56 ] |
| cast_experience_log | 0.39 | 6.38 | 229 | < 0.05 | [ 0.27 , 0.49 ] |
| cast_experience | 0.37 | 6.01 | 229 | < 0.05 | [ 0.25 , 0.48 ] |
| director_experience_log | 0.27 | 4.20 | 229 | < 0.05 | [ 0.14 , 0.38 ] |
| runtime | 0.26 | 4.12 | 229 | < 0.05 | [ 0.14 , 0.38 ] |
| runtime_log | 0.26 | 4.04 | 229 | < 0.05 | [ 0.13 , 0.37 ] |
| director_experience | 0.25 | 3.89 | 229 | < 0.05 | [ 0.12 , 0.37 ] |
Figure 35: Scatterplots of variables most highly correlated with log of box office revenue
The scatter plots in Figure 35, confirmed it. The log of the number of IMDB votes would proxy for box office revenue and correlation with this dependent variable was the focus of the bivariate analysis that follows.
Categorical Variable Analysis
Association tests were conducted between the categorical variables and the log number of IMDB votes and Table 21 summarizes the most highly associated variables.
Table 21 Categorical variables most highly associated with the log number of IMDB votes.| Dependent | Independent | R.squared | F.value | p.value |
|---|---|---|---|---|
| Log IMDB Votes | genre | 0.16 | 11.18 | < 0.05 |
| Log IMDB Votes | mpaa_rating | 0.11 | 18.57 | < 0.05 |
| Log IMDB Votes | best_pic_nom | 0.05 | 33.36 | < 0.05 |
| Log IMDB Votes | thtr_rel_month | 0.03 | 1.92 | < 0.05 |
| Log IMDB Votes | best_pic_win | 0.03 | 21.15 | < 0.05 |
| Log IMDB Votes | best_dir_win | 0.03 | 20.99 | < 0.05 |
| Log IMDB Votes | thtr_rel_season | 0.03 | 4.32 | < 0.05 |
| Log IMDB Votes | best_actor_win | 0.02 | 11.69 | < 0.05 |
| Log IMDB Votes | best_actress_win | 0.01 | 6.02 | < 0.05 |
As expected, director and studio were highly associated with film popularity.
Quantitative Variable Analysis
The correlations between the quantitative variables and the log number of IMDB votes are listed in Table 22 in order of highest to lowest correlation.
Table 22 Top quantitative variables most highly correlated with the log number of IMDB votes.The log of cast votes was highly correlated with the log number of IMDB votes; whereas cast experience was significantly less correlated. Surprisingly, the number of days between the dvd release and the theatrical release was negatively correlated with IMDB votes, causing one to consider the proportion of IMDB votes cast by DVD viewers.
Part 4: Modeling
Developing a linear model that can be used to effectively predict box office performance is the focus of this section. As such, full model feature selection, model selection, model diagnostics, and model interpretation are covered here in detail.
Full Model Feature Selection.
In the prior section, association and correlation tests were conducted with a 95% confidence level. All categorical variables would remain in the model. The following quantitative variables were removed because they were redundant with other variables with higher correlation with dependent variable:
* cast_experience - redundant with the log of caset experience
* cast_votes - redundant with cast_votes_log, which has a higher correlation
* director_experience - redundant with the log transformation of director experience
* imdb_num_votes - redundant with the dependent variable, its log transformation
* imdb_rating - redundant aspect of movie performance. * runtime - redundant with runtime log variable
* scores_log - redundant with the scores variable
* thtr_days - redundant with the log transformation of the thtr_days variable
* scores - redundant with IMDB rating * audience score - redundant with IMDB rating
Thus, the full model is presented.
Table 23: Full Model| Variable | Type | Measure | Value | p.value |
|---|---|---|---|---|
| genre | Categorical | R-squared | 0.16 | < 0.05 |
| mpaa_rating | Categorical | R-squared | 0.11 | < 0.05 |
| best_pic_nom | Categorical | R-squared | 0.05 | < 0.05 |
| thtr_rel_month | Categorical | R-squared | 0.03 | < 0.05 |
| best_pic_win | Categorical | R-squared | 0.03 | < 0.05 |
| best_dir_win | Categorical | R-squared | 0.03 | < 0.05 |
| thtr_rel_season | Categorical | R-squared | 0.03 | < 0.05 |
| best_actor_win | Categorical | R-squared | 0.02 | < 0.05 |
| best_actress_win | Categorical | R-squared | 0.01 | < 0.05 |
Model Selection
Starting with the full model described above, both forward selection and backward elimination methods were employed to determine if two different models emerged from the process, then to evaluate the performance of those models.
Forward Selection
The forward selection process begins with a null model then all variables are added to the model, one-by-one, and the model that provides the greatest improvement in adjusted R-squared is selected. If the adjusted R-squared for this set was greater than the best R-squared for the model, it was selected. The process repeats until all predictors were evaluated. Table 24 provides the the forward selection process and the summary statistics for each step.
Table 24: Forward Selection Prediction Model| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | cast_votes_log | 1 | 610 | 808.90 | 0.57 | 0.57 | 0 | 0.00 |
| 2 | cast_experience_log | 2 | 609 | 487.70 | 0.62 | 0.61 | 0 | 7.91 |
| 3 | thtr_days_log | 3 | 608 | 348.09 | 0.63 | 0.63 | 0 | 2.61 |
| 4 | genre | 4 | 598 | 85.67 | 0.65 | 0.64 | 0 | 2.06 |
| 5 | critics_score | 5 | 597 | 91.62 | 0.68 | 0.68 | 0 | 4.98 |
| 6 | best_pic_nom | 6 | 596 | 89.02 | 0.69 | 0.68 | 0 | 1.33 |
| 7 | director_experience_log | 7 | 595 | 85.09 | 0.70 | 0.69 | 0 | 0.58 |
| 8 | thtr_rel_season | 8 | 591 | 69.11 | 0.70 | 0.69 | 0 | 0.29 |
| 9 | runtime_log | 9 | 590 | 66.58 | 0.70 | 0.69 | 0 | 0.43 |
| 10 | best_pic_win | 10 | 589 | 63.90 | 0.70 | 0.69 | 0 | 0.14 |
Table 24 Forward Selection Process
Model Overview
The model is summarized as follows:
\[y_i\]
Table 25: Backward Elimination Prediction Model
Part 5: Prediction
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
Part 6: Conclusion
Appendix
Appendix A: Codebook
Table 26: Movie data set codebook| Source | Type | Variable | Description |
|---|---|---|---|
| General | |||
| IMDB/RT/BO | Categorical | actor1 | First main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor2 | Second main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor3 | Third main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor4 | Fourth main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor5 | Fifth main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | audience_rating | Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright) |
| IMDB/RT/BO | Categorical | critics_rating | Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) |
| Organization | |||
| IMDB/RT/BO | Categorical | dvd_rel_day | Day of the month the movie is released on DVD |
| IMDB/RT/BO | Categorical | dvd_rel_month | Month the movie is released on DVD |
| IMDB/RT/BO | Categorical | dvd_rel_year | Year the movie is released on DVD |
| IMDB/RT/BO | Categorical | imdb_url | Link to IMDB page for the movie |
| IMDB/RT/BO | Categorical | rt_url | Link to Rotten Tomatoes page for the movie |
| IMDB/RT/BO | Categorical | thtr_rel_day | Day of the month the movie is released in theaters |
| IMDB/RT/BO | Categorical | thtr_rel_year | Year the movie is released in theaters |
| Dates | |||
| IMDB/RT/BO | Categorical | title | Title of movie |
| IMDB/RT/BO | Categorical | best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| IMDB/RT/BO | Categorical | best_actress_win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| IMDB/RT/BO | Categorical | best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| IMDB/RT/BO | Categorical | best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| IMDB/RT/BO | Categorical | best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| IMDB/RT/BO | Categorical | director | Director of the movie |
| Experience | |||
| IMDB/RT/BO | Categorical | genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| IMDB/RT/BO | Categorical | mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| Performance | |||
| IMDB/RT/BO | Categorical | studio | The studio that produced the film |
| IMDB/RT/BO | Categorical | thtr_rel_month | Month the movie is released in theaters |
| Derived | Categorical | thtr_rel_season | Season the movie was released in theaters |
| IMDB/RT/BO | Categorical | title_type | Type of movie (Documentary, Feature Film, TV Movie) |
| IMDB/RT/BO | Categorical | top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) |
| IMDB/RT/BO | Numeric | audience_score | Audience score on Rotten Tomatoes |
| Derived | Numeric | box_office | Box office revenue from BoxOfficeMojo.com |
| Derived | Numeric | box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
| Derived | Numeric | cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| Derived | Numeric | cast_experience_log | Log of the sum across all cast members for a film, of the number of films in which each actor appeared |
| Derived | Numeric | cast_votes | Total number of allocated IMDB votes for the cast of a film |
| Derived | Numeric | cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| IMDB/RT/BO | Numeric | critics_score | Critics score on Rotten Tomatoes |
| Derived | Numeric | director_experience | Total number of films in sample for a director |
| Derived | Numeric | director_experience_log | Log total number of films in sample for a director |
| Interaction | |||
| IMDB/RT/BO | Numeric | imdb_num_votes | Number of votes on IMDB |
| Derived | Numeric | imdb_num_votes_log | Log number of IMDB votes |
| Box Office | |||
| IMDB/RT/BO | Numeric | imdb_rating | Rating on IMDB |
| IMDB/RT/BO | Numeric | runtime | Runtime of movie (in minutes) |
| Derived | Numeric | runtime_log | Log runtime of movie (in minutes) |
| Derived | Numeric | scores | 10 * IMDB Rating + critics score + audience_score |
| Derived | Numeric | scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| Derived | Numeric | thtr_days | Number of days between theatre and dvd release |
| Derived | Numeric | thtr_days_log | Log number of days between theatre and dvd release |
Appendix B: Univariate Analysis
Best Director / Actor / Actress
As indicated in Figure 36, the percentages of films with best director, actor and actress oscars were 7%, 15%, and 12%, respectively. Again, the decision in this case was to assume that with random sampling, these ratios reflected the true population proportions, and to keep the variables for further analysis during the modeling stage.
Figure 36: Best director/actor/actress
Best Picture
Since the proportion of films nominated for and winning best picture were so small, this variable was not likely to be a good predictor of movie popularity. The bivariate analysis below will illuminate this further.
Figure 37: Best picture nominations and wins
Genre
The drama genre represented a plurality of the releases in the sample, followed by comedy action & adventure then mystery & suspense. The top four genres account for nearly 80% of the films in the sample. Figure 38: Films by genre
MPAA Rating
Rated R films accounted for over 0% of the releases, followed by PG and PG-13. Collectively, R, PG, and PG-13 rated films represent 90% of the films in the sample. NC-17 films were excluded from this analysis. Figure 39: Films by MPAA Rating
Month of Theatrical Release
Though the plurality of features in the sample (32%) were released during the months of January, June, October and December, the distribution of theatrical release months appeared fairly balanced within the sample. Figure 40: Theatrical releases by month
Season of Theatrical Release
The plurality of features in the sample were released during the fall and summer months with over 20% opening in the month of December alone. Figure 41: Theatrical releases by season
Audience Scores
This variable captured the audience scores from Rotten Tomatoes for each film
Table 27: Audience score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 33 | 61 | 57.4 | 83 | 50 | 100 | 0 | 28.37 | 49.4 | -1.17 | -0.26 |
Figure 42: Audience score histogram and QQ Plot
Figure 43: Audience score boxplot
Central Tendency: The summary statistics (Table 27) the central tendency for audience score was 61 points and 57.4 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28.37, corresponds with a coefficient of variation of 49.4%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.26), indicated that the distribution of audience score was approximately symmetric. The sample kurtosis (-1.17), indicated that the distribution of audience score was platykurtic or light-tailed. The histogram and QQ plot in Figure 42 reveals a left skewed distribution that departs from normality.
Outliers: The boxplot in Figure 43, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 83, and 50, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 158]. Indeed, this confirmed the existence of no outliers.
Box Office
Total lifetime box office revenue was obtained for a subset of 100 randomly selected films from the movie data set. This is an analysis of box office revenue for this random sampling.
Table 28: Box office revenue summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 231 | 2749 | 1064483 | 13349927 | 38611880 | 49655390 | 48590907 | 658672302 | 0 | 68922521 | 178.5 | 31.2 | 4.55 |
Figure 44: Box office revenue histogram and QQ Plot
Figure 45: Box office revenue boxplot
Central Tendency: The summary statistics (Table 28) the central tendency for box office was 13,349,927 dollars and 38,611,879.9 dollars for the median and mean, respectively.
Dispersion: The standard deviation, s = 68,922,520.54, corresponds with a coefficient of variation of 178.5%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.55), indicated that the distribution of box office was right-skewed. The sample kurtosis (31.2), indicated that the distribution of box office was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 44 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 45, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 1,064,483, 49,655,390, and 48,590,907, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 122,541,750.5]. Indeed, this confirmed the existence of 21 outliers.
Log Box Office
This is a log transformation of the box office variable.
Table 29: Log box office revenue summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 231 | 11.4 | 20 | 23.7 | 22.6 | 25.6 | 5.5 | 29.3 | 0 | 3.84 | 17 | -0.21 | -0.8 |
Figure 46: Log box office revenue histogram and QQ Plot
Figure 47: Log box office revenue boxplot
Central Tendency: The summary statistics (Table 29) the central tendency for log box office was 23.7 log(dollars) and 22.6 log(dollars) for the median and mean, respectively.
Dispersion: The standard deviation, s = 3.84, corresponds with a coefficient of variation of 17%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.8), indicated that the distribution of log box office was left-skewed. The sample kurtosis (-0.21), indicated that the distribution of log box office was platykurtic or light-tailed. The histogram and QQ plot in Figure 46 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 47, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 20, 25.6, and 5.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [11.75, 33.85]. Indeed, this confirmed the existence of 1 outliers.
Cast Experience
This derived variable measured the relative experience of a given cast and was defined as the sum of the observations for the cast associated with each film.
Table 30: Cast experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 5 | 6 | 7 | 7.5 | 9 | 3 | 15 | 0 | 2.16 | 29 | 0.32 | 0.89 |
Figure 48: Cast experience histogram and QQ Plot
Figure 49: Cast experience boxplot
Central Tendency: The summary statistics (Table 30) the central tendency for cast experience was 7 films and 7.5 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.16, corresponds with a coefficient of variation of 29%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (0.89), indicated that the distribution of cast experience was right-skewed. The sample kurtosis (0.32), indicated that the distribution of cast experience was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 48 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 49, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 6, 9, and 3, respectively. This yielded a 1.5xIQR ‘acceptable’ range [1.5, 13.5]. Indeed, this confirmed the existence of 7 outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Log Cast Experience
This variable is a log transformation of the cast experience variable
Table 31: Cast experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 2.3 | 2.6 | 2.8 | 2.8 | 3.2 | 0.6 | 3.9 | 0 | 0.4 | 14 | -0.72 | 0.36 |
Figure 50: Cast experience histogram and QQ Plot
Figure 51: Cast experience boxplot
Central Tendency: The summary statistics (Table 31) the central tendency for cast experience was 2.8 log(films) and 2.8 log(films) for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.4, corresponds with a coefficient of variation of 14%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.36), indicated that the distribution of cast experience was approximately symmetric. The sample kurtosis (-0.72), indicated that the distribution of cast experience was platykurtic or light-tailed. The histogram and QQ plot in Figure 50 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 51, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 2.6, 3.2, and 0.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [1.7, 4.1]. Indeed, this confirmed the existence of no outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Cast Votes
This variable captured the total number of votes allocated to each cast member for a film.
Table 32: Cast votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 183 | 18552.5 | 78379.2 | 158644.9 | 228174.2 | 209621.7 | 1504872 | 0 | 198107.9 | 124.9 | 6.57 | 2.17 |
Figure 52: Cast votes histogram and QQ Plot
Figure 53: Cast votes boxplot
Central Tendency: The summary statistics (Table 32) the central tendency for cast votes was 78,379.2 votes and 158,644.9 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 198,107.89, corresponds with a coefficient of variation of 124.9%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (2.17), indicated that the distribution of cast votes was right-skewed. The sample kurtosis (6.57), indicated that the distribution of cast votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 52 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 53, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 18,552.5, 228,174.2, and 209,621.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 542,606.75]. Indeed, this confirmed the existence of 26 outliers.
Log Cast Votes
This is a log transformation of the cast votes variable.
Table 33: Log cast votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 7.5 | 14.2 | 16.3 | 15.9 | 17.8 | 3.6 | 20.5 | 0 | 2.46 | 15.5 | -0.08 | -0.67 |
Figure 54: Log cast votes histogram and QQ Plot
Figure 55: Log cast votes boxplot
Central Tendency: The summary statistics (Table 33) the central tendency for log cast votes was 16.3 log(votes) and 15.9 log(votes) for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.46, corresponds with a coefficient of variation of 15.5%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.67), indicated that the distribution of log cast votes was left-skewed. The sample kurtosis (-0.08), indicated that the distribution of log cast votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 54 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 55, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 14.2, 17.8, and 3.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [8.8, 23.2]. Indeed, this confirmed the existence of 4 outliers.
Critics Scores
This variable captured the critics scores for each film
Table 34: Critics score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 33 | 61 | 57.4 | 83 | 50 | 100 | 0 | 28.37 | 49.4 | -1.17 | -0.26 |
Figure 56: Critics score histogram and QQ Plot
Figure 57: Critics score boxplot
Central Tendency: The summary statistics (Table 34) the central tendency for critics score was 61 points and 57.4 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28.37, corresponds with a coefficient of variation of 49.4%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.26), indicated that the distribution of critics score was approximately symmetric. The sample kurtosis (-1.17), indicated that the distribution of critics score was platykurtic or light-tailed. The histogram and QQ plot in Figure 56 reveals a left skewed distribution that departs from normality.
Outliers: The boxplot in Figure 57, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 83, and 50, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 158]. Indeed, this confirmed the existence of no outliers.
Director Experience
This derived variable measured the relative experience of a given director and was defined as the sum of the observations for the director associated with each film.
Table 35: Director experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 1 | 1 | 1.5 | 2 | 1 | 4 | 0 | 0.75 | 51.4 | 2.55 | 1.73 |
Figure 58: Director experience histogram and QQ Plot
Figure 59: Director experience boxplot
Central Tendency: The summary statistics (Table 35) the central tendency for director experience was 1 films and 1.5 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.75, corresponds with a coefficient of variation of 51.4%, indicating a high degree of dispersion.
Shape of Distribution: The sample skewness (1.73), indicated that the distribution of director experience was right-skewed. The sample kurtosis (2.55), indicated that the distribution of director experience was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 58 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 59, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 1, 2, and 1, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 3.5]. Indeed, this confirmed the existence of 20 outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Log Director Experience
This variable was a log transformation of the director experience variable.
Table 36: Log director experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 0 | 0 | 0 | 0.4 | 1 | 1 | 2 | 0 | 0.6 | 149.8 | -0.09 | 1.11 |
Figure 60: Log director experience histogram and QQ Plot
Figure 61: Log director experience boxplot
Central Tendency: The summary statistics (Table 36) the central tendency for director experience was 0 log(films) and 0.4 log(films) for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.6, corresponds with a coefficient of variation of 149.8%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (1.11), indicated that the distribution of director experience was right-skewed. The sample kurtosis (-0.09), indicated that the distribution of director experience was platykurtic or light-tailed. The histogram and QQ plot in Figure 60 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 61, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 0, 1, and 1, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 2.5]. Indeed, this confirmed the existence of no outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Number of IMDB Votes
This variable captured the number of IMDB votes cast for each film.
Table 37: IMDB votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 183 | 5035 | 16511 | 60193.3 | 62773 | 57738 | 893008 | 0 | 114459.8 | 190.2 | 19.12 | 3.96 |
Figure 62: IMDB votes histogram and QQ Plot
Figure 63: IMDB votes boxplot
Central Tendency: The summary statistics (Table 37) the central tendency for imdb votes was 16,511 votes and 60,193.3 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 114,459.75, corresponds with a coefficient of variation of 190.2%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (3.96), indicated that the distribution of imdb votes was right-skewed. The sample kurtosis (19.12), indicated that the distribution of imdb votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 62 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 63, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 5,035, 62,773, and 57,738, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 149,380]. Indeed, this confirmed the existence of 67 outliers.
Log Number of IMDB Votes
This was a log transformation of the IMDB votes variable.
Table 38: Log IMDB votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 7.5 | 12.3 | 14 | 14.1 | 15.9 | 3.6 | 19.8 | 0 | 2.35 | 16.6 | -0.56 | 0.03 |
Figure 64: Log IMDB votes histogram and QQ Plot
Figure 65: Log IMDB votes boxplot
Central Tendency: The summary statistics (Table 38) the central tendency for imdb log votes was 14 log votes and 14.1 log votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.35, corresponds with a coefficient of variation of 16.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.03), indicated that the distribution of imdb log votes was approximately symmetric. The sample kurtosis (-0.56), indicated that the distribution of imdb log votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 64 reveal a nearly normal distribution.
Outliers: The boxplot in Figure 65, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 12.3, 15.9, and 3.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.9, 21.3]. Indeed, this confirmed the existence of no outliers.
IMDB Ratings
This variable captured the IMDB rating for each film
Table 39: IMDB rating summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1.9 | 5.9 | 6.6 | 6.5 | 7.3 | 1.4 | 9 | 0 | 1.08 | 16.6 | 1.31 | -0.89 |
Figure 66: IMDB rating histogram and QQ Plot
Figure 67: IMDB rating boxplot
Central Tendency: The summary statistics (Table 39) the central tendency for imdb rating was 6.6 points and 6.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 1.08, corresponds with a coefficient of variation of 16.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.89), indicated that the distribution of imdb rating was left-skewed. The sample kurtosis (1.31), indicated that the distribution of imdb rating was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 66 reveal a nearly normal distribution.
Outliers: The boxplot in Figure 67, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 5.9, 7.3, and 1.4, respectively. This yielded a 1.5xIQR ‘acceptable’ range [3.8, 9.4]. Indeed, this confirmed the existence of 18 outliers.
Runtime
This is an analysis of moving runtimes.
Table 40: Runtime summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 65 | 93 | 103 | 106.6 | 116 | 23 | 267 | 0 | 19.12 | 17.9 | 9.67 | 1.98 |
Figure 68: Runtime histogram and QQ Plot
Figure 69: Runtime boxplot
Central Tendency: The summary statistics (Table 40) the central tendency for runtime was 103 minutes and 106.6 minutes for the median and mean, respectively.
Dispersion: The standard deviation, s = 19.12, corresponds with a coefficient of variation of 17.9%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (1.98), indicated that the distribution of runtime was right-skewed. The sample kurtosis (9.67), indicated that the distribution of runtime was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 68 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 69, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 93, 116, and 23, respectively. This yielded a 1.5xIQR ‘acceptable’ range [58.5, 150.5]. Indeed, this confirmed the existence of 16 outliers.
Log Runtime
This is an analysis of the log of moving runtimes.
Table 41: Log runtime summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 6 | 6.5 | 6.7 | 6.7 | 6.9 | 0.3 | 8.1 | 0 | 0.24 | 3.5 | 2.06 | 0.86 |
Figure 70: Log runtime histogram and QQ Plot
Figure 71: Log runtime boxplot
Central Tendency: The summary statistics (Table 41) the central tendency for log runtime was 6.7 minutes and 6.7 minutes for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.24, corresponds with a coefficient of variation of 3.5%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.86), indicated that the distribution of log runtime was right-skewed. The sample kurtosis (2.06), indicated that the distribution of log runtime was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 70 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 71, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 6.5, 6.9, and 0.3, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.05, 7.35]. Indeed, this confirmed the existence of 9 outliers.
Scores
This variable captured the total score for each film defined as 10 * IMDB Rating + critics score + audience_score.
Table 42: Scores summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 38 | 145 | 187 | 184.5 | 232 | 87 | 284 | 0 | 54.52 | 29.6 | -0.82 | -0.32 |
Figure 72: Scores histogram and QQ Plot
Figure 73: Scores boxplot
Central Tendency: The summary statistics (Table 42) the central tendency for total scores was 187 points and 184.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 54.52, corresponds with a coefficient of variation of 29.6%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.32), indicated that the distribution of total scores was approximately symmetric. The sample kurtosis (-0.82), indicated that the distribution of total scores was platykurtic or light-tailed. The histogram and QQ plot in Figure 72 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 73, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 145, 232, and 87, respectively. This yielded a 1.5xIQR ‘acceptable’ range [14.5, 362.5]. Indeed, this confirmed the existence of no outliers.
Log Scores
This is a log transformation of scores variable.
Table 43: Log scores summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 5.2 | 7.2 | 7.5 | 7.5 | 7.9 | 0.7 | 8.1 | 0 | 0.5 | 6.7 | 1.12 | -1.07 |
Figure 74: Log scores histogram and QQ Plot
Figure 75: Log scores boxplot
Central Tendency: The summary statistics (Table 43) the central tendency for log total scores was 7.5 points and 7.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.5, corresponds with a coefficient of variation of 6.7%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-1.07), indicated that the distribution of log total scores was left-skewed. The sample kurtosis (1.12), indicated that the distribution of log total scores was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 74 reveals a left skewed distribution that departs rather significantly from normality.
Outliers: The boxplot in Figure 75, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 7.2, 7.9, and 0.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.15, 8.95]. Indeed, this confirmed the existence of 13 outliers.
Theatre Days
This variable captured the number of days between theatrical and dvd release
Table 44: Theatre says summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 1 | 132 | 391 | 2304.5 | 3910 | 3778 | 12309 | 0 | 3118.24 | 135.3 | 0.62 | 1.34 |
Figure 76: Theatre days histogram and QQ Plot
Figure 77: Theatre days boxplot
Central Tendency: The summary statistics (Table 44) the central tendency for days in theatre was 391 days and 2,304.5 days for the median and mean, respectively.
Dispersion: The standard deviation, s = 3,118.24, corresponds with a coefficient of variation of 135.3%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (1.34), indicated that the distribution of days in theatre was right-skewed. The sample kurtosis (0.62), indicated that the distribution of days in theatre was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 76 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 77, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 132, 3,910, and 3,778, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 9,577]. Indeed, this confirmed the existence of 28 outliers.
Theatre Days
This variable captured the number of days between theatrical and dvd release
Table 45: Log number of theatre days summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 617 | 0 | 7 | 8.6 | 9.3 | 11.9 | 4.9 | 13.6 | 0 | 2.53 | 27.2 | -1.16 | 0.18 |
Figure 78: Log number of theatre days histogram and QQ Plot
Figure 79: Log number of theatre days boxplot
Central Tendency: The summary statistics (Table 45) the central tendency for days in theatre was 8.6 days and 9.3 days for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.53, corresponds with a coefficient of variation of 27.2%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (0.18), indicated that the distribution of days in theatre was approximately symmetric. The sample kurtosis (-1.16), indicated that the distribution of days in theatre was platykurtic or light-tailed. The histogram and QQ plot in Figure 78 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 79, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 7, 11.9, and 4.9, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 19.25]. Indeed, this confirmed the existence of no outliers.